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| Abstract. New algorithms are proposed for the Tucker approximation of 

a 3-tensor, that access it using only the tensor-by-vector-by-vector multiplication 

^ subroutine. In the matrix case, Krylov methods are methods of choice to approx- 

imate the dominant column and row subspaces of a sparse or structured matrix 
given through the matrix-by-vector multiplication subroutine. Using the Wedder- 
burn rank reduction formula, we propose an algorithm of matrix approximation that 
computes Krylov subspaces and allows generalization to the tensor case. Several 

CSl variants of proposed tensor algorithms differ by pivoting strategies, overall cost and 

quality of approximation. By convincing numerical experiments we show that the 
proposed methods are faster and more accurate than the minimal Krylov recursion, 
proposed recently by Elden and Savas. 
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a 1 Introduction 



In this paper we focus on algorithms for the low-rank approximation of large three- 
dimensional arrays (tensors), that play increasingly important role in many applications. 
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Throughout the paper, a tensor A = [a^J means an array with three indices. The 
indices are also called modes or axes. The number of allowed values for a mode index 
is called the mode size. In numerical work with tensors of large mode sizes it is crucial 
to look for the data sparse structures. Most common are the following. 

The canonical decomposition [18, 3, 17] (or canonical approximation, if the right- 
hand side does not give A exactly) of a tensor A = [ayjj reads 

R R 

A = ^u s ®v s ^w s , a ijk = ^u is v js w ks . (C) 

s=1 s=1 

The minimal possible number of summands is called the tensor rank or the canonical 
rank of a given tensor A. However, the canonical decomposition/approximation of a 
tensor with minimal value of R is an ill-posed and computationally unstable problem 
[8], and among several practical algorithms none is known to be absolutely robust. 
The (truncated) Tucker decomposition/ approximation [35] of A reads 

1"1 T 2 T 3 

A = G x, U x 2 V x 3 W, a ijk = g pqs u lp v jq w ks . (T) 

p=1 q=l s=1 

The quantities r 1 ,T2,r 3 are referred to as the Tucker ranks or the mode ranks, the 
T] x T2 x r 3 tensor G = [g pqs ] is called the Tucker core. In d dimensions, the memory 
to store the r x r x . . . x r core is r d , that is usually beyond affordable for large d and 
modest r (so-called curse of dimensionality). In three dimensions for r ~ 100 the 
storage for the core is small and the Tucker decomposition can be used efficiently 

Here and below, the symbol x t designates the multiplication of a tensor by a matrix 
along the l-th mode. For example, B = A x 2 M means summation on 2 nd index = 
Y_y TTLjj'O-ij'ic. The notation for tensor operations is not yet standard (for current state 
of art see review [21]), we use the one proposed by de Lathauwer in the article on 
multilinear SVD [6] (or higher-order SVD, HOSVD). In [6] the Tucker format arises 
with additional constraints: orthogonality of factors, that is also assumed in our paper, 
and all-orthogonality of the core, that we will relax. For a tensor given as a full array 
of elements, the multilinear SVD provides a quasi-optimal Tucker approximation which 
further can be refined by different iterative methods such that Tucker-ALS [22, 7], 
Newton- Grassmann [10, 19], etc. It reduces to the SVD for three unfolding matrices 
(see [6] and (13) later in this paper) and costs 0(n 4 ) operations for U] x ri2 x n$ tensor. $ 
It is clearly too much for large n, and we should look for alternatives. 

In the matrix case two classes of fast methods are popular: cross algorithms and 
Krylov-based approaches. Cross methods compute rank-r approximation by interpo- 
lating the Ut x n 2 matrix on a cleverly chosen set of crosses, proposed for instance 
in [36, 15]. That requires Q(nr) evaluations of matrix elements and can be used for ma- 
trices implicitly given by a subroutine that evaluates any prescribed element. However, 
the verification of the approximation requires several heuristics. If the matrix is struc- 
tured (sparse, Toeplitz or Hankel, low-rank, sum and/or product of above, etc.) and a 

$ We always assume ni = 712 = TI3 = tl and ri = T2 = r$ — r in complexity estimates 
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fast matrix-by-vector product is available, then the Krylov-type methods are methods 
of choice with established convergence and complexity estimates. 

The generalization of the cross method to 3-tensors (Cross3D, [25]) requires the 
careful algorithmic implementation and several tweaks to make it really efficient. Gen- 
erally, the Cross3D interpolates a tensor on 0(ur + r 3 ) elements and uses 0(ur 2 + r 4 ) 
additional operations. However, the accuracy check also requires heuristics. 

Recently, Elden and Savas [30] proposed two generalizations of Krylov methods to 
tensors: minimal and maximal Krylov recursion. The minimal recursion requires only 3r 
tensor-by- vector-by- vector multiplications to compute basis sets U, V, W for the Tucker 
rank-(r, r, r) approximation (T), but sometimes it converges slowly and even does not 
converge (for example, for tensors with sufficiently different mode ranks). The maximal 
recursion is convergent, but needs definitely unaffordable number of tensor operations. 
The goal of this paper is to propose algorithms that use tensor only through the tensor- 
by-vector-by-vector multiplication subroutine, have asymptotical complexity equal to 
the one of minimal Krylov recursion whereas have better convergence properties. We 
also try to keep a link with the existing theory of matrix approximation and with ideas 
used in [25] for the cross approximation of tensors. 

This paper is organized as follows. In the section 2 we propose the optimization 
of minimal Krylov recursion, inspired by the idea of maximization of the orthogonal 
component of new vectors. In the section 3 we recall the Wedderburn rank reduction 
formula that gives a nice framework for many matrix factorizations [5]. Then we propose 
a variant of a matrix decomposition that is similar to the Gaussian elimination with 
pivoting but uses the matrix through the matrix-by-vector products. In the section 4 
we generalize this method to the tensor case. The Wedderburn elimination process 
gives some freedom in a selection of the vectors to eliminate, and we use it proposing 
several 'pivoting' strategies, that lead to different complexity estimates and convergence 
properties. Some pivoting strategies turn the proposed method into the minimal Krylov 
recursion and the optimized minimal Krylov recursion. In the section 5 we apply the 
proposed family of algorithms for the approximation of different structured tensors and 
give numerical comparison with previous methods. 

The tensor-by-vector-by-vector multiplication, shortly the tenvec operation, can 
be defined via the tensor-by-matrix products. For example, tenvec of ni x n 2 x n 3 tensor 
A = [omj at modes 2, 3 reads 

u = A x 2 v T x 3 w T , ut = aij k VjW k 

j=l Jc=1 

where v, w are vectors of length n 2 and u 3 respectively and the result is a vector u of 
length ni . This is the solely tensor operation in algorithms in this paper and we propose 
extremely simple notation 

a T T def - . t t def - « T T def - 

A x 2 v x 3 w = Avw, A x 3 w X]U = Awu, AX]U x 2 v = Auv, 
relaxing the information on contraction modes, that is always obvious from the symbolic 
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notation and mode sizes of vectors. This simple notation (that generalizes the Tdd 
notation [2]) reveals the analogy of tensor algorithms with the matrix case. 

We consider the approximation of matrices and tensors in the Frobenius norm 



n, n 2 n 3 




2 def 



(A, A) , (A, B) = Z Z Z 



i=i j=i k=i 



For theoretical estimates we also use the spectral norm of tensor (cf. [9]) 




max (, 

||x|| = ||y|H|z||=l 



A, x ® y si z) , 



induced by the standard vector norm ||x|| 2 = ||x||2 = (x, x) = ^{Li l x il 2 - 

It is worth to mention that discussed algorithms first aim to construct the orthogonal 
bases U, V and W of the dominant mode subspaces, and the core G of the Tucker 
approximation can be computed afterwards. For the fixed U, V, W the most accurate 
approximation in Frobenius norm is obtained with the core 



In general, (1) can be computed using r 2 tenvecs, but faster implementations are avail- 
able for certain structured tensors and can essentially improve the complexity. We can 
also be satisfied with the sub-optimal but fast formula for G, for example the one based 
on interpolation on a maximum- volume set of indices [14] as proposed in [26]. If the 
computation of (1) is not a necessary part of the discussed algorithm, this cost is not 
included in the total complexity. 

2 Minimal Krylov recursion and its optimization 

The generalization of the Krylov subspace method to the problem of tensor approxima- 
tion was first proposed by Elden and Savas in [30]. Two variants are discussed, namely 
the minimal and the maximum Krylov recursion. 

The minimal Krylov recursion (MKR, see Alg. 1) requires 3 tenvecs on each iteration, 
therefore, basis sets U = U T , V = V T , W = W r are computed in 3r tenvecs. However, 
their 'quality', i.e. the accuracy of the approximation A = Gx 1 Ux 2 Vx 3 W with the 
optimal core (1) may be low, and for some cases ||A — A||f will not reduce at all, even 
for large ranks. 

As an example, consider an n x n x n tensor A with only two non-zero slices* 



Obviously, if Ai ^ A 2 , the mode-3 subspace of A consists of two vectors W = [e-\ e 2 ]. 
Starting the MKR Alg. 1 from some Ui,v-|, we accumulate W 2 = [eiej in two steps. 

$ Here and after we use the MATLAB-style notation, with ":" denoting all possible index values 



G=Ax,U T x 2 V T x 3 W T . 



(1) 



A(:,:,1)=Ai 



A(:,:,2)=A 2 . 
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Algorithm 1: [30] Minimal Krylov recursion for tensor approximation (MKR) 
Input: Tenvec subroutine for tensor A 

Output: Mode subspaces U, V, W for the Tucker approximation 
Initialization: Unit vectors ui , vi 

w, := Au^/HAuivJ, Ut = [u,],Vi = [vi],Wi = [w{\ 

for k = 1,2, ... do 

u:=Av k w k ; u' := (I - U k U^)u; u k+1 :=u'/||u'||; U k+1 :=[U k u k+1 ] 
v := Aw k u k+1 ; v' := (I — V k V k )v; v k+ i ■= v' /\\v'\\; V k+J := [V k v k+1 ] 
w := Au k+1 v k+1 ; w' := (I-W k W£)w; w k+1 :=w'/||w'||; W k+1 := [W k w k+1 ] 
end for 



Then all the computed vectors w have zero component w' orthogonal to W 2 . This 
situation is referred to as breakdown, since we cannot continue the process. 

In [30] Elden and Savas propose to fix breakdowns by taking an orthogonal vector 
to the subspace W. However with W3 := e$ we come to zero vector u := AV3W3 on 
the next iteration and can not continue the process. Another possible workaround, 
proposed in [31] is to use the last basis vector on all the subsequent iterations, setting 
W3 := W2,W4 := W2 and so on. This should be done when the subspace of the mode- 
3 vectors of A equals to spanW k . However, it is difficult to check this fact using only 
tenvec operations and it is not clear how to do this in the non-exact case which is always 
true in the machine precision arithmetic. In the numerical examples (see section 5.2) 
we show that the poor convergence and even the stagnation can occur also for 'more 
practical' situations. 

Prom this discussion we derive the idea of optimization of the Alg. 1. Consider the 
step for u and generalize it as follows. 

u := A(V k v k )(W k w k ); u' := (I - U k U^)u; u k+1 := u'/||u'||; U k+1 := [U k u k+1 ]. 

New direction u is generated by the tenvec of A with some vector from spanV and 
some vector from spanW. In MKR we always take v k = w k = e k , but there is no 
background theory behind this choice and it seems to be not optimal in practice. To 
improve the approximation properties of U k , we could choose v k and w k to maximize 
the norm of the orthogonal component u'. Since 

u' = (I-U k Uj)u=(Ax, (I-UtcUJ) x 2 V k T x 3 wT) x 2 v T k x 3 w J k , 

we are to find 

v k ,w k = arg max ||Bvw|| for B = A x } (I — U k ll£) x 2 x 3 W k . (2) 

N=UNH 

The global maximum is not required. We can satisfy with the sufficiently large u' 
that can be found by several iterations of the alternating least squares method (ALS, 
see [22] and Alg. 2), for which the local linear convergence is proved in the rank-(1 , 1,1) 
case [7, 38]. If the ALS converges to the best rank-one approximation, it also solves (2) 



5 



Algorithm 2: [22] ALS rank-(1 , 1 , 1 ) iteration 
Input: Tenvec subroutine for tensor A 

Output: Best rank-(1 , 1,1) approximation cru ® v ® w for tensor 
Initialization: Unit vectors v, w 
for k = 1 , . . . , p a i s do 
u := Avw; cr:=||u||; u:=u/||u|| 
v := Awu; cr:=||v||; v:=v/||v|| 
w := Auv; a:=||w||; w := w/||w|| 
end for 



as a dual problem. To see this, consider the approximation B = bu ® v ® w with unit 
u,v, w and compute the optimal 1x1x1 core b by (1) 

b := B X] u T x 2 v T x 3 w T = (B, u ® v ® w) = (Bvw, u). 

The approximation B that minimizes the approximation error 

|| B — B||p=||B||p — 2(B, bu<av ® w) + ||bu® v ®w|| F = ||B||p — |b| 2 

also maximizes |b| and solves (2), since 

max |b| = max max(Bvw, u) = max ||Bvw||. 

||u|| — ||v|| — ||vv|| — 1 ll^ll = l|w||=1 ||u||=1 l|v||=||w||=l 

Each ALS iteration requires 3 tenvecs with B, that can be rendered by 3 tenvecs 
with A and 0(nk) operations for the orthogonalization. The optimization of the MKR 
is summarized in terms of tenvecs operations in the Alg. 3. 

If p a i s ALS iterations are used to solve (2) on Step 6, then Alg. 3 requires (3 + 9p a i s )t 
tenvecs and 0(nr 2 p a i s ) additional operations. 

In numerical experiments (see section 5) we show that the optimized MKR Alg. 3 
shows the better convergence that the MKR Alg. 1. However, convergence estimates 
are still missing, even in the exact low-rank case. Is it possible to develop a method 
that has guaranteed convergence at least in the exact case? We start from the matrix 
case, using Wedderburn rank-one reduction formula. Then, in section 4, we propose 
generalization of Wedderburn method to tensor case and reintroduce the MKR and the 
optimized MKR as different versions of the same Wedderburn process, for which we 
discuss the convergence and give numerical examples and comparison. 

3 Matrix approximation using Wedderburn rank re- 
duction formula 

3.1 Preliminaries 

Many matrix decomposition algorithms can be represented as a sequence of rank-one 
Wedderburn updates [37]. For a matrix A and vectors x,u of appropriate sizes, such 
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Algorithm 3: Optimized minimal Krylov recursion for tensor approximation 
Input: Tenvec subroutine for tensor A, tolerance parameter tol 
Output: Mode subspaces U, V, W for approximation A^A = Gx 1 Ux 2 Vx 3 W 
Initialization: Unit vectors U] , V] 

l: w, := Au^i/HAuiVtH, U : = [ui],Vi = [vi],W| = [wi] 

2: updll = updV = updW = true; k = I = m = 1 

3: while updll or updV or updW do 

4: if updll then {Proceed with the new vector u if required} 
5: Define B = A x, (I-U k U£) x 2 x 3 

6: Find v k ,w k := arg max||<>|| =1) ||vv||=i || Bvw|| by p a i s ALS steps, see Alg. 2 
7: u:=A(Vi<hc)(W m i^); u' != (I-U^u 
8: if ||u'|| < tol||u|| then {Breakdown} 
9: Fix breakdown or set updll := false 

10: else 

11: u k+1 := u'/llu'H; U k+ i := [U k u k+1 ]; k:=k+l 

12: end if 
13: end if 

14: {Proceed with new vector v if required} 
15: {Proceed with new vector w if required} 
16: end while 

that x T Au ^ 0, the matrix 

Aux T A . . 

B = A ^— . 3 

x T Au v 

has rank B = rank A — 1 . For the rank-r matrix Ao = A after r updates of form 

A k = A^ - Ak -^ A kX ^ Ak - 1 (4) 

X^A k _!U k 

with cu k = f x^A k _iy k ^ 0, the matrix A T becomes zero and the rank-r decomposition of 
A can be constructed. 

In [5] the properties of the Wedderburn sequence (4) are studied in much detail. We 
recall a list of basic facts about (4) in the compact and 'more matrix' form. 

Statement 1. Matrix A k writes as A k = P^A = AQ k with Po = I, Qo = I and 

Pk = Pk-i - ^k 1 Pk-ix k u^A T P k _, , Q k = Q k _! - cu k 1 Qk-iyicxJ AQ k _i . (5) 

Corollary 1. P k x k = and Q k u k = 0. 

Statement 2. Matrices P k , Q k are projectors, i.e. V\ = P k and Q k = Q k . 
Obviously, P = Pq = I, Qo = Qo — I an d from P^_ 1 = P k _! it follows that 

Pk = Pti " u)fc 1 Pt 1 y k x£A T P lc _ 1 " cv^P^ylA 7 ? 2 ^ + 

+ cu k - 2 P k _ 1 x k ijjA T PtiXicyjA T P k _i = P k . 

V v ' 

W k 
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Statement 3. P k P m = P m P k = P max (m,k), and Q k Q m = QmQk = Qmax(m,k). 
Start from 

QkQk-i = Qk_i - cu k 1 Q k _iy k x[AQ^_ 1 = Q k _! - cu^Qit-iyit^AQ^ = Q k 
and complete the proof by induction. 

Corollary 2. For m ^ k it holds P k x m = P k P m x m = and Q k t) m = Q k Q m Um = 0. 

def def 

Statement 4. For biconjugate vectors u k = P k _ix k and v k = Q k tTJ^ it holds 

T * T * def 

u k Av k = x k A k _!ij k = cu k , 
since A k _! = AQ k _! = AQ£_, = A k _iQ k _! = P^AQi^. Also for m ^ k it holds 

u^Av k = x^P^_! AQ k _!y k = x^A max ( m _ 1)k _ 1} -y k = 0. 
With U k = f [ui ... u k ], V k = f [vi ... v k ] and I2 k = f diag(cui . . . cu k ) we conclude 

li£AV k = O k . (6) 
Corollary 3. For the valid Wedderburn process U k and V k have full rank. 
Corollary 4. For m ^ k it holds P k u m = P k P m -iX m = P k x m = and Q k v m = 0. 

Statement 5. Each rank elimination step (4) writes without multiplication by A k _i as 
follows 

A k = A k _, - = A - f AVp<u{A. 

Now the rank-k Wedderburn approximation reads 

A k = AV k O k 1 A, A = A k + A k . (7) 
For the rank-r matrix A the residual A r = and the approximation A r is exact. 

We see that each Wedderburn update (4) adds vector u k to the kernel and x k to 
the cokernel of the residual A k = A — A k . The approximation A k , as a linear operator, 
interpolates A on subspaces spanned by X k and Y k , exactly 

X k A k = X k A, A k Y k = AY k , U k A k = U k A, A k V k = AV k . (8) 

This can be associated with the Gaussian elimination, that gives an approximation 
exact on certain rows and columns of the matrix. In this respect we refer to the 
process (4) as to the Wedderburn elimination. 

In [5] it is shown how the proper choice of x k ,u k reduces the Wedderburn elimination 
to the well-known matrix decompositions such as LU, QR, cross factorizations and 
Lanczos bidiagonalization. We consider another principle for the selection of vectors 
x k ,u k at each step in the matrix case, that can be associated with Gaussian elimination 
with column or row pivoting. It produces a new method for matrix approximation, that 
is simply generalized to tensors while maintaining the convergence. 
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3.2 Pivoting in Wedderburn elimination 

The idea behind the proposed choice of x k ,y k is minimization of Frobenrus norm of 
the residual, that is important when we deal with the full-rank matrix of a low e-rank, 
i.e. that can be approximated by the low-rank matrix with the accuracy e. As shown 
in [5], the minimization of residual w.r.t. unit x^y^ gives k-th singular vectors of A, 
that are not known in advance. We propose another minimization strategy, based on 
the following theorem. 

Theorem 1. Consider Wedderburn step (3). Then 

for fixed y, x opt == arg min 

IWI=1 

for fixed x, y opt = f arg min 

Proof. Since the valid Wedderburn step does not depend on scaling of x, y, for the fixed 
y we can constrain x to satisfy x T Ay = 1 . Then 

x op t = min ||(Ay)(A T x) - All^ = min II (A T x)(Ay) T - A T ||p . 

x: x T Ay=l x: x T Ay=l 

The least squares problem solves by (A T x) opt = A T Ay/|| Ay || 2 and x opt = Ay/\\ Ay || 2 . 
After normalization we have (9), and (10) follows by substituting A = A T . □ 

Eqs. (9) and (10) show how to reach the fast decay of the residual in the Wedderburn 
process (4) either by choosing optimal x k for given y k or by choosing optimal y k for 
given x k . This can be associated with the column and row pivoting in the Gaussian 
elimination. Thus, we refer to the Wedderburn process (4) with arbitrary y k and optimal 
x k as to Wedderburn elimination with column pivoting (WCP) and to Wedderburn 
process with arbitrary x k and optimal y k as to Wedderburn elimination with row 
pivoting (WRP). The steps of WCP and WRP shortly read 

choose y k , set x k = ..^^ ,, , A k = (I - x k x k ) A k _r, (WCP) 

IIAic-iyicll 

A T x k 

choose x k , set y k = — A k = A k ^ (I - y k y[) . {WRP) 

ll A k-i x i<ll 

Theorem 2. For the Wedderburn elimination with column pivoting it holds 

1. X k = [x] . . .x k ] has orthonormal columns X^X k = I; 

2. P k = P£ = I — X k X£ is the projector on the subspace orthogonal to spanX k ; 

def 

3. biconjugate vectors u k = P k _ix k = x k . 

For the Wedderburn elimination with row pivoting it holds 



A 



A-yx 1 A 



x J Ay 



IIAyir 



(9) 



A 



Ayx' A 



x T Ay 



A'x 



(10) 
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Algorithm 4: Wedderburn elimination with column pivoting (WCP) 



Input: Matvec subroutine for the matrix A, tolerance parameter tol, accuracy e 
Output: Approximation A with the accuracy ||A — A||f < £ ||A||f 
Initialization: k = 0, Xo = [0], Bo = [0], nrm = 
1: repeat 

2: k := k + 1 , choose unit vector u k 
3: x:=Au k ; x' := (I-X^Xj.^x 
4: if ||x'|| < tol||x|| then {Breakdown} 

5: return A = X] C _iB^_ 1 {or repeat current iteration with another u k } 
6: else 

7: Xk := x'/llx'H 
8: end if 

9: b k := A T x k , err := ||b k ||, nrm 2 := nrm 2 + ||b k || 2 
10: X k := [X k _! xj, B k := [B k _! b k ] 
11: until err ^ enrm 
12: return A = X k B£ 



1. Y k = . . . y k ] has orthonormal columns Y^Yk = I; 

2. Q k = Qi = I- Yk Y k T ; 

3. v k = Q k -iu k = y k . 

Proof. Let us prove statements for WRP by induction. It is easy to check them for 
k = 1. Suppose they hold at step k— 1. Optimal choice of u k by (10) reads 

y = Al_^x k = QLiA T x k = (I - Y M Y^)A T x b u k = y/||y||. 

This shows ||y k || = 1, Y^tj^ = and the first statement follows for Y k := [Y k _i y k ]. By 
substituting y k = A k _iXic/|| A k _-|X k || in (5) we prove the second statement 

Klk = fT = Vk-1 1 t = 

= (I - YwYLi) (I - y^vl) = I - Y k _! Yj_i - y k yl = I - Y k Y k T . 
Finally, the last statement reads 

def n QLiA T x k Q k -i A T x k QL^V 

v k = Qk-iyk = Qk-i |, A T „ I, = N A T „ || = N A T Z7 1| = ^. 

ll^k-l^ll ll^k-l^ll ll /v k-1 x k|l 

Statements for the WCP are proven in the same way. □ 

Based on this theorem, we propose a Wedderburn elimination algorithm with the 
column pivoting (WCP, see Alg. 4). The approximation is sought in the form A k = X k B k 
with B k = A T X k , that follows from (8) and orthogonality of biconjugate vectors U k = X k , 
result of the Theorem 2. To explain the stopping criteria of Alg. 4, write 



nrm 



= II A k || F = IIXtcXIAIIf = || A T X k || F = ||B k || F , err d = ||A k - A k ^ || F = ||b k || F , 
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where the first part estimates the norm of matrix by the norm of approximation and 
the second estimates the error of approximation by the norm of the update. The 
breakdown can occur if new vector x has the neglectable component x' orthogonal to 
the accumulated subspace X^-i. This is regulated by the tolerance parameter tol, that 
could be chosen close to machine precision. We can try to fix the breakdown by another 
selection of ijk, or choose to terminate the algorithm. 

The WRP version of algorithm follows by substituting A = A T . 



3.3 Relation to SVD and Lanczos bidiagonalization 

On each step of the WCP Alg. 4 we perform the Wedderburn elimination, choosing 
optimal 'pivot' Xk for given ijk. It is also important to select proper 'leading vectors' ijk 
to obtain faster convergence of approximation and avoid breakdowns. We could think 
about the maximization of a)k = x^Ak-iijk = ||Aic-itJic|| = ||x'||. Solving exactly 

y k = argmax ||A k _iy|| = argmax II (I - X^X^Ayll , (11) 

lly||=i IMH 

we reduce the Wedderburn process to a sequence of best rank-one approximations with 
tu k an d Xk,ijk being singular values of A (sorted descending) and corresponding left and 
right singular vectors. Therefore, the WCP with selection of the leading vector by (11) 
is equivalent to the SVD and provides the best rank-r approximation in the Frobenius 
norm. This approach can be associated with the full pivoting in the Gaussian elimi- 
nation. Each maximization problem (11) can be accurately solved by power iterations 
using only matvec operations, but in matrix case this approach generally is considered 
as quite expensive and faster alternatives are used. 

One of them in the Lanczos bidiagonalization, see [12] and Alg. 5. It generates bases 
X k = [x] . . .xj and Y k = [y-\ . . ."yj such that X^X k = I, Y^Yc = I and the matrix X^AY k 
is bidiagonal. 

The following theorem shows that the Lanczos bidiagonalization is similar to the 
Alg. 4 if the leading vector is selected as follows 

_ Aj_iXk _ A T x k 

This value is well defined, since 

A^Xk = A T P k -ix k = A T (I - Xk-iX^Jxk = A T x k = b k> 

and HA^Xkll = err vanishes only when the stopping criteria err < enrm is met and 
the next leading vector Tjk+i is not required. 

Theorem 3. Vectors X {lnc} = [x^j lnc ', . . . , x^' 110 '] of the Lanczos process (Alg. 5) initialized 
by xo, coincide with vectors X^ = [x^ cp \ . . . ,x^ ] generated by WCP Alg. 4 that 
starts from y n = A T x and chooses ijk as proposed by (12), providing both algorithms 
do not meet breakdowns. 
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Algorithm 5: [12] Lanczos bidiagonalization 



Initialization: y = 0, |3o = 0> unit vector Xo 
for k = 1 , 2, . . . do 

y:=A T x k -i; y'^y-Pk-iUic-i, ^ '■= \\v% Uk : = Tj'/llu' 
x := Ay k ; x' := x — <x k x k , (3 k :=||x'||, x k := x'/Hx'H 

end for 



Proof. If no breakdowns are met, then 

spanX k wcp} = span{A yi , (AA T )A-y b . . . , (AA 1 )^ 1 AyO, 
spanX { k lnc} = span{(AA T )x , (AA T ) 2 x , . . . , (AA T ) k x }. 

If yi = A T xo then spanX k wcp} = spanX k nc} for every k. Since columns of X k wcp} are 
orthonormal (Theorem 2), as well as X k nc \ we conclude X^ ^ = X k nc ^. □ 

Thus, in terms of x k , the WCP gives the same result as the Lanczos bidiagonal- 
ization. However, the sequence y k in the WCP differs from the one of the Lanczos 
bidiagonalization. First, we note that 



y = A^Ait-iy* = A T P k _i P k _ 1 Ay k = A T P k _!Ay k = Q^_ 1 A T Ay k , y 



k+1 



and span Y k = span{yi, (A T A)yi, . . . , (A T A) k ^i}. Sequence y k is 'almost orthogonal', 
i.e. for m ^ k — 2 it holds 

^ Vm ~ || AT Ay^ || "° 

since Q k _2ym = by the Corollary 2. Also, for X k and Y k generated by the WCP, the 
matrix X k AY k is tridiagonal, i.e. 

^m A Uk = y T n+ iyk = 0, for m^{k-2,k-1,k}. 

Remark 1. For k ^ 3 the vector X k _jX = X k _ 1 Ay k on Step 3 of Alg. 4 has only two 
last nonzero components. Hence, the orthogonalization step can be simplified to 

x := Ay k , x' := (I — x k _ 2 x k _ 2 - Xk-i* k _i ) x - 

This allows the short recursion for the orthogonalization of x k in the WCP Alg. 4, 
as it is done in the Lanczos bidiagonalization Alg. 5. We summarize this version of 
the WCP in the Alg. 6. In the machine arithmetic the orthogonality can be violated 
by roundoff errors and the reorthogonalization is required. The convergence of the 
Lanczos bidiagonalization method is well studied and although this method can have 
breakdowns, it is considered to converge from 'almost every' initial vector [29, 13, 16]. 
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Algorithm 6: WCP with Lanczos-like selection of leading vector 
Input: Matvec subroutine for matrix A, tolerance parameter tol, accuracy £ 
Output: Approximation A with accuracy ||A — A||f < £||A||p 
Initialization: k = 0, Xo = x_i = 0, A = 0, nrm = 0, unit vector y n 
1: repeat 

2: k := k + 1 , x := Ay k ; x' := (I — x k _ 2 x k _ 2 ~ *ic-ix k -i )* 
3: if ||x'|| < tol||x|| then {Breakdown} 
4: return A = A k _] 
5: else 

6: Xy, := x'/\\x'\\ 

7: end if 

8: y k+1 := A T x k , err := ||u k+1 1|, nrm 2 := nrm 2 + ||y k+1 1| 2 

9: A k = A k _! + X k yl +] 
10: until err ^ £ nrm 
11: return A = A k 

4 Tensor approximation using Wedderburn rank reduc- 
tion 

4.1 Computing dominant mode subspaces by WCP algorithm 

We are ready to propose the extension of the Wedderburn elimination to the tensor 
case. For the Tucker approximation we are to approximate the dominant subspaces by 
U, V, W that contain much of information about mode vectors of tensor. In [6] this is 
done by SVD applied to the unfoldings of an ui x U2 x U3 tensor A = [cLyiJ . They are 
matrices 

A^ = [agj, A« = [agj, A® = [a®], a™ = ag> = a$ = a, k , (13) 

of size ni x U2U3, U2 x U1U3 and U3 x nin2, that consist of columns, rows and tube fibres 
of A, respectively. Left singular vectors after appropriate truncation give Tucker factors 
U, V, W, and the core is found by (1). Since the SVD is applied to tensors that are given 
as full array of elements, the computation costs 0(n 4 ) and can not is not possible for 
large tensors (even sparse or structured). We approximate dominant subspaces of mode 
vectors applying the Wedderburn elimination to the unfoldings. 

We aim for algorithms that compute the rank-f^ , r 2 , r 3 ) approximation of an n]Xn 2 x 
n3 tensor using 0(r 2 ) tenvecs and 0(n) additional operations. This is asymptotically 
equal to the cost of the MKR Alg. 1, if the core tensor is computed. With this restriction 
for Ui x n 2 n 3 unfolding A = A' 1 ' = [a^jj we can not compute x = Ay and y = A T x for 
arbitrary vectors y of size n2n3 and x of size ui, since these operations require 0(u 2 ) 
storage and n tenvecs. To develop the Krylov-type methods and stay within the linear 
complexity in mode size, we should use only those operations, that can be accomplished 



13 



Algorithm 7: Wedderburn elimination for dominant subspace computation 
Input: Tenvec subroutine for A, tolerance tol, accuracy e, maximum size r max . 
Output: Mode subspace U for Tucker approximation A such that ||A — A||f < £||A||f 
Initialization: X = [0], nrm = 0, k = 
1: repeat 

2: k := k + 1 , choose unit vectors u k , z k (see Section 4.2) 
3: x := Ay k z k ; x' := (I - Xk-iXj.Jx 
4: if ||x'|| < tol||x|| then {Breakdown} 

5: return U = X k _i {or repeat current iteration with another y k , z k } 
6: else 

7: x k := x'/llx'H; X k := [X k _! x k ] 
8: end if 

9: Choose unit z randomly 

10: for k = 1, . . . ,p pow do {Power iterations to approximate A X] xj cryz T } 
11: y:=(Ax lX T) z = Ax,x[x 3 z T = Azx k , == llu IU U == U/llu II 
12: z:= (A Xt xl) J \) = A x, x[ x 2 u T = Ax k y, o-:=||z||, z:=z/||z|| 
13: end for 

14: err := cr, nrm 2 := nrm 2 + err 2 
15: until err ^ £ nrm or k = r max 
16: return U = X k 



by small number of tenvecs. For instance, x = Ay is substituted by 

x = A{y cs z) = A x 2 y 1 x 3 z T = Ayz, 

that means that we will use only those 'long vectors' y that are the tensor product y ®z 
of some y of size u 2 and z of size n 3 . Evaluation of y = A T x = A x-\ x is also infeasible 
and we substitute it with the approximation 

y « z ~ A X] x, 

that means that we develop the algorithms that give accurate tensor approximation, 
using only certain rank-one approximation of A x ! x instead of the precise result. 

In the matrix case the difference between column and row pivoting in the Wedder- 
burn elimination is not significant, since the properties of column basis in the WCP coin- 
cide with ones of row basis in the WRP and vice versa. In the tensor case with imposed 
restriction the multiplication A{y ®z) is accurate, but the multiplication A T x = Ax]X T 
is always done approximately with an error of truncation to rank one. Considering this 
difference, we use the WCP algorithm for the tensor case, since good properties of X k 
given by Theorem 2 persist here. Using the direct analogy with the WCP Alg. 4 for 
matrices, we propose a method to derive dominant mode-1 subspace U, see Alg. 7. 

Each iteration begins with the choice of the 'long' leading vector in the form u k ®z k . 
It can be done arbitrarily, and we propose some good strategies in the section 4.2. The 
new direction x and basis vector x k appear exactly like in the matrix case. To terminate 
the method, we can use one or both of the stopping criteria: 
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• fix maximum number of iterations, i.e. desired size of basis U by r max , 

• find basis U that allows approximation of A with relative accuracy e. 

In the WCP Alg. 4 the error was estimated by the norm of vector b k = A T x^. In the 
tensor algorithm we should estimate the Frobenius norm of the matrix A x j x k instead, 
but it can not be evaluated by a small number of tenvecs and we substitute it by the 

def 

spectral norm err = || A x i X\c\\2- To estimate this, we use the power method for n 2 x n 3 
matrix A Xi xj. It is initialized by some randomly chosen unit vector z of size n 3 . In the 
power method, a converges to the maximum singular value of matrix = A X] xj and 
B k = cryz T converges to the best rank-one approximation of B k . Since the high precision 
is not necessary for the error estimation, we can satisfy with the fixed small number 
Ppow of power iteration steps. If only the fixed-rank stopping criteria is desired, power 
iterations on Steps 9-13 of Alg. 7 can be omitted. 

As well as in the matrix case, we can meet with the breakdown if new vector x 
have almost zero (neglectable in the machine precision) component x' orthogonal to the 
subspace X k _i . We can try to fix it by repeating the current step with another selection 
of Z\, or choose to terminate the algorithm. 

Mode-2 and mode-3 bases V and W can be computed by the same algorithm after 
obvious permutation of modes. Directly from the Statement 5 in matrix case we derive 
the following theorem, that proves the convergence of tensor methods based on Alg. 7 
for the exact-rank case. 

Theorem 4. For the tensor A with mode sizes Ui,ri2,n3 and ranks ri,T2,T3, three 
applications of Alg. 7 return bases U, V and W of sizes x ri, n 2 x r 2 and n 3 x r 3 
that allow the Tucker decomposition A = Gx 1 Ux 2 Vx 3 W with core G given by (1), 
providing that computations are not terminated by breakdowns. 

Remark 2. The convergence in the exact-rank case is guaranteed for any choice of the 
leading vectors and z\ that does not lead to the breakdown. 

In the section 4.2 we propose a number of strategies for the selection of the leading 
vectors, that are based on the different optimization ideas and lead to the methods 
with different complexity and convergence properties. However, the convergence in 
the exact-rank case, which we consider as the necessary requirement, persists for all 
methods based on the Alg. 7. 

Remark 3. The minimal Krylov recursion Alg. 1 can be also considered as variant of 
Alg. 7 with the special choice of leading vectors. 

To achieve the reduction of mode-1 rank after the first iteration of the Alg. 7 we 
should approximate the tensor by 

A ~ Ai = Xi « B] with Bt=AxiXi. (14) 

This does not comply the restrictions imposed to reach the linear complexity. Therefore, 
we consider Alg. 7 as a method to generate an approximation of the dominant mode 
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subspaces that is guaranteed to converge in r steps for the tensor with the mode rank 
r, but that does not reduce the mode ranks by one on each iteration. 

Remark 4. The mode ranks of the rank-fr] , r 3 ) tensor in general can not be reduced 
by the elimination of the rank-(1 , 1,1) approximation. 

To take use of the approximation (14), we can further approximate B] by low-rank 
(or even rank-one) format. To do this by tenvec operations, we can use 'internal' 
Wedderburn elimination steps for This approach directly follows the basic idea 
of [25], where the cross approximation method is generalized to 3-tensors. Vectors x k 
can be associated with the mode fibres selected in the tensor, and B k with the slices, 
for which the internal cross approximation scheme is used. As well as in the matrix 
case, we emphasize the very important link between the Krylov subspaces approach and 
cross approximation methods, that is established by the Wedderburn framework. 

In the following we restrict the discussion to only rank-one approximation of B k and 
pay by losing the mode rank reduction property, but result in the methods that are 
more efficient since they have 'symmetric' behaviour in respect to different modes. 

4.2 Selecting leading vector 

By the Theorem 4, in Alg. 7 the every choice of the leading vectors u k , z k that does not 
lead to the breakdown, ensures the convergence in the exact-rank case. However the 
different choices of leading vectors result in the approximations with different accuracy 
until the exact representation is found. Therefore, we should choose the leading vectors 
in a clever way, that ensures the fast convergence to the dominant subspaces and is com- 
putationally feasible. In the following we propose four strategies that lead to different 
maximization problems and result in different complexity estimates and convergence 
properties. 

4.2.1 SVD-like strategy (Wsvd) 

We can apply Alg. 7 three times and compute U, V, W in the completely independent 
processes, even using three processors on a distributed memory system. In algorithm 
for U = X k the best way to keep from breakdowns is to choose y k o z k that maximizes 
the orthogonal component ||x'||. 

y k ,z k = arg max II (I - X^X^HAuz) 

IMHMH 

B = A Xl (I-X^X^i). 

This is the direct analogy with the SVD approach (11) in the matrix case. The difference 
is that resulted x k and u k ® z k are not singular vectors of the unfolding A = A' 1 ', since 
the maximization is done w.r.t. 'long' right vectors with tensor product structure. This 
is the reason why the best tensor rank-(r, r, r) approximation can not be computed by 
the simultaneous elimination of the best rank-(1, 1,1) approximations. 



arg max ||Byz|| , 

IMHMH ' ( 15 ) 
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Nevertheless with this choice we are safe from breakdowns. Zero-valued orthogonal 
component ||x'|| appears only if max|| y ||=|| z ||=i ||Bijz|| = and hence B = 0, which means 
that the exact representation = f A X] (X^X^) with mode-1 rank ri = k is computed 
for A. The following theorem shows that 'machine precision breakdown' also happens 
only when the approximation is of the machine precision accuracy. 

Theorem 5. If the breakdown ||x'|| < tol||x|| is met on the step k + 1 of the Alg. 7, 
than the computed subspace Xk provides the approximation Ak = A x , (X^X^) with the 
Tucker rank r-\ = k and accuracy ||A — A k ||2 < "t oX || A || 2 in spectral norm. 

Proof. On the step k + 1 we choose yk+i,Zk+i = arg max|| y || = || z |j = i ||Buz||. The norm of 
the orthogonal component reads 

llx'11 =||Buk+iZk+i II = rnax ||Buz|| 

IMHMH 

def 

= max max(Byz, x) = max (B, x®u®z) = ||B|| 2 . 

llvlHMH NH NHIylHMH 

Also for x = Aijk+iZk+i it holds 

def 

||x|| = HAijic+iZk+i II — max(Aijk+iZk+i,x) ^ max (A, x®y ®z) — ||A||2. 

IMH NHMHMH 

Now the breakdown criteria ||x'|| < tol||x|| gives ||B||2 < "tol || A || 2. Finally we write 

B = A Xl (I-X k Xj)=A-Ax 1 (XkX£)=A-A k , 
that completes the proof since mode-1 rank of A k is equal to ri = rankX k = k. □ 

Corollary 5. Alg. 7 with the SVD-like strategy (15) applied to tensor A with mode 
ranks r],^,^ computes bases U, V, W that allow the exact representation A = G X] 
U X2 V x 3 W after ri, X2 and r 3 iterations, respectively. 

Note that since Xk = [Xk-i xj is orthogonal, it holds 
Ax,x[ = A Xl ((I-X k _ 1 Xj_i)x k ) T = (Ax, (I-X^XjU)) x ] %l = Bx ] xl 

The maximization problem (15) can be solved by the ALS Alg. 2 applied for B. In this 
case power iterations for matrix A x, on Step 9-13 of Alg. 7 can be omitted, since 
the error estimate err = ||A x, x k || 2 = ||B x, x k || 2 is actually computed in the ALS 
iterations. 

On the step k of the Wedderburn method inner ALS iterations cost 3p a i s tenvecs 
and 0(p a i s nk) operations for the orthogonalization. This summarizes to 3p a i s r tenvecs 
and 0(p a i s nr 2 ) additional operations for one dominant subspace. 
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4.2.2 Lanczos-like strategy (Wine) 

We can also use the analogy with the Lanczos choice (12) by taking unit u k ® z k ~ 
A X] x k /||A Xi x[\\. This leads to the dual maximization problem 

y k ,z k = arg max II A x, x[ x 2 y J x 3 z T 

NHMH 

B = A Xi x k . 

The solution can be accomplished by p pow steps of the power iteration method applied 
to matrix B. On the step k of the Wedderburn method it costs 2p pow tenvecs. Note 
that the maximization problem (16) is actually solved on Steps 9-13 of Alg. 7 by power 
iterations that computes best rank-1 approximation of A Xi x[ to estimate the norm of 
the residual. Thus, the Wine pivoting strategy requires only to set u k := y and z k := z 
after power iterations as new vectors of the Wedderburn process. 

4.2.3 Restricted SVD-like strategy (WsvdR) 

Three Wedderburn elimination algorithms can be used to extend all mode bases si- 
multaneously. Suppose k — 1 , 1 and m steps were done to compute mode subspaces 
Xic_i , Yi, Z m . Then on the step k of the Wedderburn process for the mode-1 subspace 
we can make use of V = Yi and W = Z m by restricting the maximization (15) to the 
tensor product of these subspaces. Therefore, we take y k = Y l y k , z k = Z m z k and solve 

y k ,£ k = arg max ||(I - X^X^) {k[Yiy){Z m z)) 

B = A x , (I - Xic-iX^ ) x 2 Y L T x 3 Z T m . 

This maximization problems exactly matches the (2)! 

Remark 5. The optimized MKR is a variant of the Wedderburn process for tensors 
with restricted SVD-like strategy of pivoting. 

The WsvdR approach can lead to the slow convergence or breakdowns at the first 
iterations, but when X k> Y;, Z m become larger, chances to meet the breakdown vanish, 
since (17) becomes close to the unrestricted maximization (15). Numerical experiments 
provided in the section 5.2 show that sometimes the restricted SVD strategy gives even 
better results than unrestricted SVD strategy. The reason is probably that with this 
restrictions the ALS iterations are not caught in the local minima. 

Complexity on the step k for the mode-1 subspace is 3p a i s tenvecs and 0(p a i s uk) 
operations for the orthogonalization, total complexity is 9p a i s r tenvecs and 0(p a i s nr 2 ) 
additional operations. 

4.2.4 Restricted Lanczos-like strategy (WlncR) 



arg max u Bz 

NHMH " 1 



(16) 



arg max ||Byz|| 

llflll=PIH (17) 
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Algorithm 8: Wedderburn with restricted Lanczos-like pivoting strategy (WlncR) 



Input: Tenvec subroutine for the tensor A, tolerance tol, accuracy £ 
Output: Approximation A = Gx 1 Ux 2 Vx 3 W such that ||A — A|| F < ||A||f 
Initialization: Unit vectors u, v,w, updX = updY = updZ = true, 
l: *i = Avw/||Avw||; tji = Awit/||Awu||; zi = Auv/||Auv||, k = I = m = 1 
2: Xi = [xi],Y| = [-yi],Zi = [zi], G = A Xt x{ x 2 y| x 3 z}; nrm = ||G|| F 
3: while updX or updY or updZ do 



x: {Proceed with new vector x if required} 

4: if updll then 

5: For B = G(k, :,:) solve B «: B = by k £ k {best rank-one approximation} 

6: y k :=Y v y k , z k := Z m £ k ; x := Au k z k ; x' := (I — X k X£)x 

7: if ||x'|| < tol||x|| then {Breakdown} 

8: updX := false {or repeat current iteration step with another u k , z k } 

9: else 

10: x k+ i :=x7||x'||; X k+1 := [X k x k+1 ] 

11: Enlarge G by G(k + 1 ,:,:):= A x , x£ +1 x 2 Yj x 3 Z T m 

12: err := ||G(k+ 1,:, :)||f, nrm 2 := nrm 2 + err 2 , k:=k+l 

13: if err < £ nrm then {Convergence} 

14: updX := false 

15: end if 

16: end if 

17: end if 

y: {Proceed with new vector y if required} 

z: {Proceed with new vector z if required} 



18: end while 

19: return U = X k , V = Y u W = Z m , A = Gx 1 Ux 2 Vx 3 W 



Finally, we combine the Lanczos-like selection of leading vectors (16) and the re- 
stricted maximization. 

y k ,z k = arg max ||(Y l y) T (A x k )(Z m z)|| = arg max ||y T Bz|| 

Hy||=llzlH llvlhPIH 

B = Ax,xJx 2 Y[ x 3 Z^, y k = Y t y k , z k = Z m z k . 

If only bases U, V, W for the approximation are required, the maximization can be 
accomplished by p pow steps of power iteration method applied to the matrix B. At 
every step of the Wedderburn method it requires 2p pow tenvecs. But if the core tensor 
is also desired, the efficiency can be highly improved by precomputing Baslxm matrix. 
Comparing (18) and (1) we note that B is exactly the last mode-1 slice from the optimal 
core for the approximation of A in bases U = X k , V = Yi, W = Z m 

G(k, :, :) = A x , (U(:, k)) T x 2 V T x 3 W T = A x , x£ x 2 Y, T x 3 Z J m = B. 

Thus if we need the core G, we prefer to compute it slice-by-slice in the Wedderburn 
process. Then we can apply standard matrix tools (SVD or cross methods) to find the 
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Table 1. Complexity of the algorithms for rank-(r 1 ,T2,r 3 ) tensor approximation 



name 


description 


output 


tenvecs 


1\/TK"R 
IVliVix 


1V11I1. lYiyiCJV IcLUIolCJIl IOUJ r\lg. 1 


1 1 V w 

Li, V, vv 


O I 


Wsvd 


Alg. 7 with strategy (15) 


u,v,w 


9p ai s r + 3r 


Wine 


Alg. 7 with strategy (16) 


u,v,w 


6p P owT + 3r 


WsvdR 


Alg. 7 with strategy (17), Alg. 3 


u,v,w 


9p a i s r + 3r 


WlncR 


Alg. 7 with strategy (18), Alg. 8 


u,v,w 


6p P owT + 3r 


WlncR 


-//- 


U, V, W, G 


r 2 + 3r 



best rank-one approximation of B and solve (18) without additional tensor operations. 
This version of the Wedderburn elimination algorithm for tensors is summarized in the 
Alg. 8. 

4.3 Comparison of the algorithms 

To compare the versions of the proposed algorithm, we give their complexities in Table 1. 
If only subspaces U, V, W for Tucker rank-(ri , T2, r^) approximation are required, than 
the minimal Krylov recursion [30] is fastest in terms of number of tenvecs used, since it 
requires only 3r tensor operations. All versions of the Wedderburn elimination Alg. 7 
also require 0(r) tenvecs, with factor depending only on the selected number of ALS or 
power iterations. Note that for p a i s = Pp OW) Lanczos-like pivoting strategy and SVD-like 
pivoting take roughly the same time, but SVD-like pivoting is guaranteed to be free 
from breakdowns. This differ tensor algorithms from the matrix case. 

If the core G for Tucker approximation is required, we generally need additional 
r 2 tenvecs to evaluate it by (1). In this case, the fastest version of the Wedderburn 
elimination algorithm is the WlncR Alg. 8. It uses exactly the same number of tenvecs 
as the MKR, but shows better convergence in numerical experiments. 

5 Numerical examples 

The numerical experiments presented in this section were performed on the Intel Xeon 
Quad-Core E5504 CPU running at 2.00 GHz. In the section 5.1 we use MATLAB version 
7.7.0, in sections 5.2, 5.3 we use Intel Fortran compiler version 11.1 and BLAS/LAPACK 
routines provided by the MKL library. 

5.1 Sparse tensors 

The tensor decomposition of a sparse large-dimensional arrays is an important tool 
in the network analysis, that is widely used now in the information science, sociology 
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Figure 1. Approximation accuracy for the Caltech Facebook graph 




We approximate the 597 x 597 x 8 2 tensor using three algorithms and plot relative 
accuracy in the Frobenius norm for different values of mode ranks. TALS is the 
Tucker- ALS, MKR is the minimal Krylov recursion, WlncR is the Wedderburn 
elimination with the Lanczos-like restricted pivoting. 



and many other disciplines. We consider one example from [34], where a very nice 
introduction to network analysis is given. 

The example is the graph of the Facebook social network from Caltech University, 
coupled with the dormitory information. The graph represents relations between n = 
597 people living in m = 8 dorms by sparse n x n x m x m 4-tensor A. Every entry 
aypq = 1 means that the person number i lives in the dorm number p and links to the 
person number j, who lives in the dorm q. There are 25646 unit elements in A = [awpq], 
and all others are zeroes. Since the discussed methods apply to 3-tensors, we join 
dorm indices p and q in multi-index and consider A = [a.y, p q] as the 3-tensor of size 
nxnxm 2 . The relative accuracy of the approximation in the Frobenius norm is shown on 
Fig. 1. We compare three methods: the Tucker-ALS [22, 7] (MATLAB implementation 
from Tensor Toolbox [1]), the minimal Krylov recursion (Alg. 1) and the Wedderburn 
elimination method with Lanczos-like restricted pivoting (Alg. 8). We note that the 
WlncR Alg. 1 converges slowly on the first iterations, when dimensions of accumulated 
bases are small, imposing serious restrictions in the maximization (18). However, for 
the larger values of ranks, the WlncR becomes more accurate than the MKR method. 

The Tucker-ALS is the most accurate but computationally demanding method. On 
each Tucker-ALS iteration the Tucker factors are subsequently updated as follows. With 
two factors fixed, for example V and W, we compute the n-\ x r 2 x r 3 tensor 

B = Ax 2 V T x 3 W T , B = B {1) (19) 



21 



Then for the rii x vzTt, unfolding B = B' 1 ' we find the approximation B ps B =: UG with 
the n] x r-\ orthogonal matrix U that is a new Tucker factor and the r-\ x r2r3 matrix 
G that is then reshaped to new T] x T2 x r 3 core tensor G. The rank-ri approximation 
of B can be computed by the SVD or cross approximation methods (c.f. [36]). The 
evaluation of (19) requires r2r3 tenvecs, which results in 3r 2 tenvecs on each iteration, 
much more than the complexity of the MKR and the WlncR. 

We do not provide timings in this section, because MATLAB-based computations 
are usually far from being highly optimized. Some timings will appear in the next 
sections for the Fortran implementation of discussed algorithms. 

This example was introduced to us by Prof. Lars Elden. In [31] more experiments 
with the approximation of sparse tensors by different algorithms are provided and re- 
sults are compared for the truncated HOSVD, the minimal Krylov recursion, several 
modifications of the MKR and the WlncR Alg. 8 that was implemented by authors 
of [31] with minor modifications. 

5.2 Compression from canonical to Tucker format 

Multidimensional data often appear in modern modelling programs in canonical form 
(C). For example, in chemical modelling programs, e.g. PC GAMESS and MOLPRO, 
the electron density function is given as a sum of the tensor product of one-dimensional 
Gaussians. However, even for simple molecules, the number of terms in the decompo- 
sition obtained by MOLPRO may be too large for practically feasible computations. 
In order to make computations efficient, the further approximation (recompression) to 
the Tucker format can be performed. The accuracy of the desired approximation can 
vary in different applications. For some quantum chemistry problems the very precise 
approximation (with ten or more significant digits) is required. 

Such recompression was done in [4] using the Tucker-ALS algorithm [22, 7], in [20] 
by the Tucker-ALS with the initial guess obtained from the coarser grids, in [11] by the 
Cross3D algorithm [25], in [26] by the individual cross approximation of canonical factors 
and in [32] by the cross approximation of Gram matrices of unfoldings. For an n x n x n 
tensor given in canonical form with R terms, each tenvec costs 3uR operations, and 
the proposed versions of Alg. 7 can be applied to compute the Tucker approximation 
efficiently, even for large n and R. 

We apply the discussed algorithm for the Tucker approximation of the electron den- 
sity of some simple molecules, discretized on the uniform n x n x n tensor grid with 
n = 5121. The convergence of algorithms, i.e. the accuracy of rank-(r, r, r) approxima- 
tion for different r is shown on Figs. 2,3. On each graph we compare the accuracy of the 
approximation computed by the Tucker-ALS [22, 7] (thin solid line) with the accuracy 
of the certain approximation method. For each method, the internal estimate of the 
error err is shown by thin dashed line and real accuracy ||A — A||f of the algorithm is 
shown by the thick dashed line. The core of the Tucker approximation was computed 
by (1). Since the evaluation of the whole n x n x n array for n = 5121 requires one 
terabyte of memory and a lot of computational resources, we verify the accuracy of 
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Figure 2. Approximation accuracy of the methane electron density 




For 5121 x 5121 x 5121 tensor given in the canonical form, with rank 1334, we compute the 
Tucker approximation by algorithms, listed in Tab. 1, and plot the relative accuracy in 
the Frobenius norm for different values of mode ranks. On each graph the thick dashed 
line shows the true error of approximation, the thin dashed line shows the estimate of 
error and the thin solid line shows the accuracy of the HOSVD approximation for the 

reference. 
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Figure 3. Approximation accuracy of the glycine electron density 





For 5121 x 5121 x 5121 tensor given in the canonical form, with rank 9208, we compute the 
Tucker approximation by algorithms, listed in Tab. 1, and plot the relative accuracy in 
the Frobenius norm for different values of mode ranks. On each graph the thick dashed 
line shows the true error of approximation, the thin dashed line shows the estimate of 
error and the thin solid line shows the accuracy of the HOSVD approximation for the 

reference. 
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Table 2. Time for approximation of electron density 
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85 


69 


37 
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24 
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57 


13 


33 
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43 
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For 5121 x 5121 x 5121 tensor given in the canonical form with rank R, we compute the 
Tucker approximation with different relative accuracy bound by different algorithms and 

show time in seconds. Algorithms MKR, Wsvd, Wine, WsvdR, WlncR are listed in 
Tab. 1, 'cross' algorithm is based on individual cross approximation of canonical 

factors [26], 'TALS(l)' is time for one iteration of the Tucker- ALS method [22, 7] 



algorithms by comparison of the result with the Tucker approximation computed by 
the individual cross approximation of canonical factors [26] with the accuracy set to 
e = 10~ 12 . The verification of the latter was done in [26] by the exhaustive verification 
on cluster platforms. The residual between two tensors in the Tucker format is computed 
as proposed in [27]. 

We note the slow convergence of the MKR for the methane electron density and 
the breakdown for glycine. All versions of the Wedderburn elimination converge much 
better, and for the larger glycine molecule the convergence is even more regular, than 
for methane. Accuracy of methods with the Lanczos-like pivoting is close to the optimal 
one, and accuracy of method with the restricted SVD-like pivoting is almost equal to 
optimal, except for first steps of the process, when accumulated subspaces are small 
and impose significant restrictions on the selection of leading vectors. Also, the internal 
error value, estimated by norm of A Xi%£ (and the same for other modes), is less regular 
than the real accuracy, that decays monotonically. In the methods with unrestricted 
pivoting (see Wine) the internal error value is computed by the spectral norm of matrix 
A X] and appears to be 'more optimistic' than the real error, that is measured in 
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the Frobenius norm. In methods with the restricted pivoting the Frobenms norm of 
A X] xL_ ! x 2 x 3 ZJ m is used to estimate the internal error and matches the real error 
more closely. 

Timings for the approximation for different molecules and accuracy parameters are 
given in Tab. 2. We compare all methods listed in Tab. 1 and the method proposed 
in [26] based on incomplete cross approximation [36] of unfoldings. For Wedderburn 
elimination methods we set p ais = p pow = 3. From the preliminary experiments we 
see that this rather small number of 'inner' iterations is sufficient; the experiments 
with p a i s and p pow set to 10, 30 and 300 show almost the same convergence and final 
accuracy for all the molecules. For the reference we also provide time for one iteration 
of the Tucker-ALS method [22, 7], that uses output of Alg. 8 as an initial guess. In our 
implementation of the Tucker-ALS the low-rank decomposition of (19) is computed by 
the cross approximation, that is usually several (2420) times faster than the SVD-based 
computations. However, even then one iteration of the Tucker-ALS seems to be quite 
expensive. In practical computations several (usually 3 4- 20) iterations of the Tucker- 
ALS are required depending on the accuracy of the initial guess; it is also necessary 
to estimate the ranks of the desired approximation before starting the Tucker-ALS. 
In section 5.1 we explain that each Tucker-ALS iteration requires 3r 2 tenvecs, while 
Wedderburn methods require (r) tenvecs to compute Tucker factors U, V, W and r 2 
tenvecs to generate the core G (see Tab. 1). Therefore, for large ranks and fixed p pow and 
Pais) Wedderburn methods become more efficient in comparison with a single iteration 
of the Tucker-ALS. We can note this behaviour in Tab. 2 on the lines corresponding 
to the high precision of approximation, where the ranks of the approximation are also 
high. 

We note that for this example the Wedderburn elimination with the restricted 
Lanczos-like pivoting (Alg. 8) is faster than all other versions of the Wedderburn elim- 
ination method. We also note that it outperforms the minimal Krylov recursion, since 
Alg. 1 converges slowly (or even does not converge) and uses more iterations to reach the 
same accuracy level. WlncR also outperforms method based on cross approximation 
of unfoldings [26]. We conclude that for this problem Alg. 8 outperforms previously 
proposed methods. 

5.3 Recompression in operations with structured tensors 

The efficient operations with matrices and vectors in compressed tensor formats is cru- 
cial in the construction of efficient iterative methods for solving equations and eigen- 
problems in three and more dimensions. The approach to such highly-efficient tensor 
linear algebra subroutines was discussed in [27, 33], where it is shown that the efficient 
evaluation of all basic linear algebra subroutines with tensor-structured data is based on 
the fast recompression of certain structured tensor. As an example, consider Hadamard 
(elementwise) multiplication between rii x n.2 x n 3 tensors 

A = G X, U (A) x 2 V (A) x 3 W [A) , B = Hx, U (B) x 2 V (B) x 3 W (B) . 
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Let mode ranks of A be ri,T2,T3 and mode ranks of B be pi,p2,p3. The result reads 

C = Fx 1 Ux 2 Vx 3 W (20) 

def 

with piTi x p 2 r 2 x p 3 r 3 core F = Kron(G,H) and non- orthogonal factors U, V, W of 
sizes n.! x pi^, n 2 x p 2 r 2 and n 3 x p 3 r 3 , respectively. Formally this is again the Tucker 
format with the core and factors given by 

F(ap,bq,cs) =G(p,q,s)H(a,b,c), U(i, ap) = Ut A Hi,p)U( B »(i, a), 

and so on for V, W. The mode ranks of C are products of correspondent mode ranks of 
A and B, and recompression is required to reduce the storage size. 

In [33] the fast recompression method based on the individual filtering of factors 
was proposed. Numerical examples in [33] include evaluation of the Hartree potential 
for the electron density of molecules, discussed in section 5.2. This problem writes as a 
multiplication between three-level matrix given in the canonical format (with diagonal 
core tensor) and three-dimensional vector of the electron density given in the Tucker 
format. The computation requires 10 4- 120 seconds depending on the complexity of 
molecule and desired accuracy of evaluation. However, examples of Tucker to Tucker 
multiplication were not presented in [33], since this operation appears to be sufficiently 
more expensive. For the large molecules it requires up to an hour, that we consider not 
affordable. 

We show that fast and accurate multiplication between tensors given in the Tucker 
format can be done using Wedderburn-based methods. Each tenvec operation with 
tensor (20) can be done in 0(q 4 + nq 2 ), where q = max(p,r), that is fast enough 
even for n up to hundred thousands and p,r up to several hundreds. We apply the 
discussed algorithm for the Hadamard multiplication of the discretized electron density 
of simple molecules to themselves. This operation can be a building block for algorithms 
that compute pointwise nonlinear functions of large tensors with linear in mode size 
complexity. One of the important applications is the cubic root of the electron density 
that appear in the Kohn-Sham model. A good initial guess for such methods can be 
evaluated by the mimic algorithm [26]. 

The convergence of algorithms, i.e. the accuracy of rank-(r, r, r) approximation 
for different r is shown on Figs. 4, 5. On each graph we compare accuracy of the 
approximation computed by the Tucker-ALS [22, 7] (thin solid line) with the accuracy 
of certain approximation method. The real accuracy || A— A|| T was verified by comparing 
the result with the Tucker approximation computed by the Cross3D algorithm [25] 
with the accuracy set to e = 10~ 12 . The Cross3D algorithm was verified in [25, 11] 
by exhaustive check on cluster platforms. The residual between two Tucker formats is 
computed as proposed in [27]. 

As well as for recompression of electron density from canonical form, in this problem 
all versions of the Wedderburn elimination demonstrate good convergence, and for larger 
glycine molecule it is even more regular, than for methane. Comparing the different 
versions of Alg. 7, we note the same behaviour that is already described in section 5.2. 
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Figure 4. Approximation accuracy of the Hadamard square of the methane electron 
density 
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For the Hadamard square of the 5121 x 5121 x 5121 tensor given in the Tucker format 
with mode ranks (74,74,74), we compute Tucker approximation by algorithms, listed in 

Tab. 1, and plot the relative accuracy in the Frobenius norm for different values of mode 
ranks. On each graph the thick dashed line shows the true error of approximation, the 

thin dashed line shows the estimate of error and the thin solid line shows the accuracy of 

the HOSVD approximation for the reference. 
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Figure 5. Approximation accuracy of the Hadamard square of the methane electron 
density 




For the Hadamard square of the 5121 x 5121 x 5121 tensor given in the Tucker format 
with mode ranks (62,176,186), we compute Tucker approximation by algorithms, listed in 
Tab. 1, and plot the relative accuracy in the Frobenius norm for different values of mode 

ranks. On each graph the thick dashed line shows the true error of approximation, the 
thin dashed line shows the estimate of error and the thin solid line shows the accuracy of 

the HOSVD approximation for the reference. 
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Table 3. Time for approximation of electron density 
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For #ie Hadamard square of the 5121 x 5121 x 5121 tensor given in Tucker form, we 
compute Tucker approximation with different relative accuracy bound by different 
algorithms and show time in seconds. Algorithms Wsvd, Wine, WsvdR, WlncR are listed 
in Tab. 1, 'TALS(l)' is time for one iteration of Tucker- ALS method [22, 7] 

Timings for the approximate computation of the Tucker-to- Tucker multiplication for 
different methods and accuracy parameters are given in Tab. 3. We compare all methods 
listed in Tab. 1 and provide time for one iteration of the Tucker- ALS method [22, 7], 
that uses output of Alg. 8 as an initial guess. For the Wedderburn elimination methods 
we set p a i s = Ppow = 3. Again, we note that for this example the WlncR Alg. 8 is faster 
than all other versions of Wedderburn elimination algorithms and also than one iteration 
of Tucker-ALS method. In this case the timings of different versions of Wedderburn 
elimination algorithms are more similar, because the cost of the evaluation of the core 
(r 2 tenvecs) dominates over 0(nr 2 ) time for additional operations. Nevertheless, we 
conclude that for this problem Alg. 8 can be method of choice for fast approximate 
evaluation of operations with data in tensor formats. 

6 Conclusion and further work 

We presented the family of algorithms for the approximation of large three-dimensional 
tensor that access it only using the tenvec operation (tensor-by-vector-by-vector mul- 
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tiplication). Our approach is based on the Wedderburn rank-reduction formula and 
admits different strategies to select vectors of the Wedderburn elimination ('pivoting') 
that lead to algorithms with rather different convergence and complexity estimates. The 
fastest algorithm from presented family, namely WlncR Alg. 8, may converge slowly or 
stagnate on the first steps of iterations. However, one can propose an efficient algo- 
rithm combining more efficient but demanding pivoting strategy on the first steps (for 
example, Wsvd, that is free from breakdowns) with the fast WlncR pivoting strategy 
on the next steps of algorithm. 

The presented methods can be applied for the approximation of dominant subspaces 
of large structured tensors. In the provided numerical examples some of the proposed 
algorithms are much faster than the Tucker-ALS [22, 7] algorithm and as fast as the 
minimal Krylov recursion [30], but more accurate in certain cases. The proposed meth- 
ods can be directly applied for the fast computation of bilinear operations between 
structured tensors, but the efficiency can be further improved by combining them with 
the individual factor filtering proposed in [33]. 

Canonical and Tucker formats can be straightforwardly generalized to d dimensions, 
however each of them has serious drawbacks, and another decomposition should be used 
for high dimensions, for example recently introduced tensor-train (TT decomposition, 
see [28, 23, 24]), that are based on the SVD techniques, but is free from the curse 
of dimensionality. Therefore, it is very natural to extend the proposed ideas to the 
TT format. In the sequel we will show how to construct algorithm for TT format 
that use tensor through tensor-by-vectors multiplication, i.e. Krylov-type methods in 
d dimensions. 

It is also important to analyse how does the choice of parameters p pow an d p a i s 
change the convergence properties and final accuracy of the proposed methods. The 
preliminary experiments show that for the experiments provided in this paper it is 
sufficient to take p a i s = p pow = 3. The deeper theoretical analysis is required, that 
should generalize the theory of Arnoldi methods for the tensor case. This is a topic of 
a forthcoming work. 
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